Exposition and Analysis of a Suffix Sorting Algorithm

نویسنده

SIMON J. PUGLISI

چکیده

This paper focuses on the suffix sorting algorithm of Maniscalco [10], which at the time of writing is available only as C++ source code on the Internet. We will refer to the program as MSufSort. MSufSort computes the Inverse Suffix Array (ISA) of an input string, which is equivalent to computing the Suffix Array (converting one to the other is discussed in section 8). Recall that for i ∈ [0..n − 1], ISA[i] gives the lexicographic rank of the suffix x[i..n − 1] amongst all the other suffixes of the string x[0..n − 1]. Experiments summarized in [10] suggest that MSufSort outperforms the fastest known suffix sorting programs, while using little extra space aside from the 4n bytes to hold the suffix array and the n bytes for the input string (in the terms of [11] it would be lightweight). It is also purported to perform well on periodic strings, which are known to be catastrophic worst cases for some algorithms. This paper addresses the need for a more formal examination of what appears to be a very robust suffix sorter. We examine and describe the inner workings of the algorithm, and try to explain why MSufSort performs well by analyzing its asymptotic behavior. As published in [10], the MSufSort source code crosses several classes and files and is not easy absorb in a single sitting. The code presented in this paper constitutes a complete rewrite of the original as just a few C functions, and is included not as an optimization, but rather to aid explanation of the approach. After introducing some notation in Section 2, the basic algorithm is described in Section 3 before two powerful heuristics are introduced in Sections 4 and 5. Sections 6 and 7 consider time and space usage. Section 8 discusses ISA to SA transformation. In Section 9 we extensively test MSufSort and compare its performance to that of other leading suffix sorters. Possible areas for future work are outlined in Sections 10 and 11, and brief conclusions are offered in Section 12. It is assumed the reader is familiar with the concept of suffix sorting and its applications, particularly the Burrows-Wheeler Transformation (BWT).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linear-time Suffix Sorting - A New Approach for Suffix Array Construction

This thesis presents a new approach for linear-time suffix sorting. It introduces a new sorting principle that can be used to build the first non-recursive linear-time suffix array construction algorithm named GSACA. Although GSACA cannot hold up with the performance of state of the art suffix array construction algorithms, the algorithm introduces a lot of new ideas for suffix array constructi...

متن کامل

An Algorithm for Suffix Sorting and Its Applications∗

The suffix tree is a data structure that has found applications in various important problems, such as genetic sequencing, pattern matching and computational biology. Its derivative data structure, the suffix array, is another representation with the added advantage of a small memory footprint. We propose a simple O(n log n) time divideand-conquer sort-and-merge algorithm for solving the suffix...

متن کامل

Parallel Suffix Sorting

We present a parallel algorithm for lexicographically sorting the suffixes of a string. Suffix sorting has applications in string processing, data compression and computational biology. The ordered list of suffixes of a string stored in an array is known as Suffix Array, an important data structure in string processing and computational biology. Our focus is on deriving a practical implementati...

متن کامل

Notes on Suffix Sorting

We study the problem of lexicographically sorting the suffixes of a string of symbols. In particular, we analyze the time complexity of Sadakane’s suffix sorting algorithm [8], showing that this is O(n log n) in the worst case. We also give a small improvement in the space requirements of this algorithm. We conclude that Sadakane’s algorithm, which has previously been shown to outperform the cl...

متن کامل

Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting

Two new algorithms for improving the speed of the LZ77 compression are proposed. One is based on a new hashing algorithm named two-level hashing that enables fast longest match searching from a sliding dictionary, and the other uses suffix sorting. The former is suitable for small dictionaries and it significantly improves the speed of gzip, which uses a naive hashing algorithm. The latter is s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Exposition and Analysis of a Suffix Sorting Algorithm

نویسنده

چکیده

منابع مشابه

Linear-time Suffix Sorting - A New Approach for Suffix Array Construction

An Algorithm for Suffix Sorting and Its Applications∗

Parallel Suffix Sorting

Notes on Suffix Sorting

Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting

عنوان ژورنال:

اشتراک گذاری